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Abstract 

Decision-theoretic troubleshooting is one of the apphcation areas of Bayesian networks. 
Given a probabihstic model of a malfunctioning man-made device, the task is to construct a 
repair strategy with minimal expected cost. The problem has received considerable attention 
over the past two decades. Efficient solution algorithms have been found for simple cases, 
whereas other variants have been proven A'P-complete. We study several variants of the 
problem found in the literature, and prove that computing approximate troubleshooting 
strategies is A^P-hard. In the proofs, we exploit a close connection to set-covering problems. 

1 Introduction 



In decision-theoretic troubleshooting Breese and Heckermanl . Il996l | , we are given a probabilistic 
model of a man-made device. The model describes faults, repair actions and diagnostic actions 
addressing the faults. Knowing that the modeled device is in a faulty state, the task is to find 
the most cost-efficient strategy for fixing the device with available repair and diagnostic actions. 
This is a natural optimization pro blem that has been studied independently iri various contexts 
since the early days of computing Johnsonl . 119561 . iBellmanl . 119571 . iGlusd . Il959t | . 



T roubleshootiiig has become one of the application areas of Bayesian net works Breese and Heckerman 
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3]. The 



19961 . iJensen et al.l . l200l[ | with interesting algorithmic problems and results [Ottosenl . 
troubleshooting problem is known to be solvable in polynomial time under quite restrictive as- 
sumption s (to be discussed in Section [2]). When the assumptions are relaxed, the problem is 
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Jensen et al. 
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A^P- hard Vomleloval. |2003|| . Efficient heurist i cs exi st that yield close-to-optimal results in prac 



Gokgav and Bilgid. |2002|. Search algorithms f or computing optimal 



troubleshooting strategies are described in Vomlelova and Vomlell . 120031 ]. However, we provide 
negative results showing that approximating the optimal solutions is in general a hard problem. 



Our Contribution We solve an open problem suggested bv lOttosenI 20121 ] - we show that 
troubleshooting with cost clusters (to be defined below) forming an acyclic directed graph is 
iVP-cor nplete, and it is J VP-hard to approximate. We improve upon known A'^P-completeness 
results Vomleloval . l2003f ] by showing hardness of approximation for troubleshooting scenarios 
containing multiple dependent faults, dependent actions or questions. 



Organization of the paper In Section [21 we overview the various setups of the troubleshoot- 
ing problem. Precise statements of our results are in Section 13.11 The proofs are presented in 
Section 13.21 The proofs utilize redu ctions from the Min-sum set cover problem Feige et al. 



2004f ] and the Decision tree problem [Garev and Johnsonl . Il979l . IChakaravarthv et al.l . 12011 ] 



2 Troubleshooting Models and Strategies 

Bayesian networks for troubleshooting Breese and Heckermanl . Il996l . I Jensen et al.l . 12001 ] 
tain variables representing 



con- 




Figure 1: Bayesian network for a troubleshooting model with single fault assumption and actions 
conditionally independent given the faults. There are three faults ~ Fi, F2, F^. To enforce the 
single fault assumption, we use a fault variable F with states {1, 2, 3} and define the probability 
tables for all Fi so that Fj = 1 if and only if F = i. There are actions Ai, A2, A3, each 
addressing one of the faults. There is a question Q that can be used to discriminate between 
F2 and F3. 



• faults, 

• repair actions, called simply actions, and 

• diagnostic actions, called questions. 

Actions have only two possible outcomes - either the system is fixed after the action has been 
performed, or it remains in a faulty state. We assume that by performing the actions, we 
cannot introduce any new faults, and we know the outcome of any action immediately after its 
execution. Questions do not alter the state of the system, but may give useful information to 
direct the troubleshooting process. 

Each action or question has an associated cost. These costs do not change ove r time except 
when stated otherwise. The actions and questions are idempotent Ottosenl . l2012l | in the sense 
that repeating a failed action does not fix the system, and repeating a question provides the 
same answer as the first time that the question was asked. 

Under the single fault assumption, there can be at most one fault present in the system at 
any moment of time. A simple troubleshooting model with the single fault assumption is shown 
in Figure [Ij 

Troubleshooting s^ra^eq?y lVomlelova and Vomlell 20031 ] is a policy governing the troubleshoot- 
ing process. An example of a troubleshooting strategy is in Figure [2j In general, troubleshooting 
strategy is a rooted directed tree with internal nodes labeled by actions and questions. Edges 
are labeled by outcomes of the actions and questions. Each path from the root of the strat- 
egy to one of the leaves corresponds to a possible troubleshooting session starting in the root 
and terminating in the leaf. Failure nodes are all the leaf nodes for which the corresponding 
troubleshooting session fails to fix the system. For a strategy S, we use this notation: 




0^~>*o e = {Ai = 0,Q = l,A3 = 0} 



^-^ o e = {Ai = 0,Q = 0,A2 = 0} 



Figure 2: A troubleshooting strategy for the model in Figured! It contains actions Ai, A2, 
A3 and a question Q. Nodes are labeled by actions and questions; edges are labeled by action 
or question outcomes. According to this strategy, each troubleshooting session will begin with 
action Ai. If it fails {Ai =0), we use question Q. Depending on the outcome of Q, we either 
perform A2 or ^43. Assume that all the Aj's and Q have unit cost, and the fault probabilities 
are V{F = 1) = ^, V{F = 2) = |, V{F = 3) = ^. Further, assume that each action Ai solves 
the respective fault Fi with certainty. Then, summing over the terminal nodes with positive 
probability (the p's in the figure), we get ECR = ^-l + |-3 + |-3 = 2. We display evidence e 
for two of the failure nodes. 
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The set of leafs, also called terminal nodes. 

The set of failure nodes, C~{S) C C{S). 

The evidence (the outcomes of all actions and questions) compiled 

along the path from the root "i? of strategy S to node i. Let E{-d,i) 

be the set of all the edges constituting the path from 1} to i. Then 

^i — UeG-E(i?£) outcome(e). An example is in Figure [2j 

The probability of reaching node i. 

The cost of performing all the actions and questions on the path from 

the root of S to node i. 

The penalty for not fixing the system. 



Since we assume that each repair action has only two possible outcomes, "1" (system fixed) 
and "0" (system still in faulty state), the outdegree of nodes labeled by repair actions is exactly 
two, with the edge labeled by "1" leading always to a terminal node in C{S) \C^{S). The goal 
is to construct a strategy S minimizing the expected cost of repair 



ECR{S) = Yl ^(e^) • t{i) + Yl ^(^^) • ^^ • 

iec(s) eec-{s) 



(1) 



Figure [2] gives an example of ECR evaluation. 

2.1 Troubleshooting without Questions 

When there are no questions, the troubleshooting strategy is just a sequence of repair actions 
Ai, . . . , An. The actions are performed in order and the troubleshooting session continues until 
the fault is fixed or all the actions have been used. In this paper, we assume the following. 



Assumption 1 The available actions Ai, . . . ,An are sufficient to fix the fault, and we never 



terminate the troubleshooting process before the system is fixed|^ 

With Assumption 1, we can ignore the penalty term of ([1]), and the ECR of a sequence 

[^1, . . . , An] can be computed by 

n 

ECR{Ai, ...,An) = Y.T^{[j{A, = 0} U {A, = 1}) ■Y.dAi), (2) 

i=l j<i j<i 

that is, we multiply the cost of the first i actions by the probability of fixing the fault with the 
i-th action. 

2.1.1 Basic Troubleshooting 

Finding optir nal strategies in po lynomial time is known to be possible under quite restrictive 



assumptions [Jensen et all l200l[ | : 



Assumption 2 There is exactly one fault present in the system (the single fault assumption). 

Assumption 3 The actions are conditionally independent given the faults, and each action 
addresses exactly one fault. 

Assumption 4 The action costs are constant over time. 

Assumption 5 There are no questions. 

Scenarios conforming to these assumptions are called basic troubleshooting. Optimal strategy 
for a basic troubleshooting scenario is computed by ordering the actions so that the sequence 
of ratios (a~) is non-increasingo Due to the single fault assumption, Formula ([2]) becomes 



ECi?(^i, . . . , A„) = J^P(Ai = 1) • J^ c(A,) . 

j=l j<i 

2.1.2 Cost Clusters 

The requirement of constant action costs, mentioned in Section [27TTT1 can be violated when some 
of the troubleshooting actions require common initialization or preparatory work. For example, 
it is often necessary to disassemble the machine to perf orm certain actions when tro ubleshooting 



a large piece of machinery. To model such situations, iLangseth and JensenI [200ll | proposed an 
extension of the basic troubleshooting model, where the set of actions is partitioned into disjoint 
subsets, called cost clusters. To access actions within a cluster /Cj, we have to pay additional 
cost to open the cluster, c(/Cj). Once a cluster /Cj is open, we c an use actions froin ICj at 
any time, possibly mixed with actions from other open clusters. lOttosen and JensenI 201Cll | 



have generalized the cost cluster scenario by allowing the cost clusters to form a tree and gave 
a polynomial-time algorithm for finding optimal troubleshooting sequences. lOttosenI 2012f | 



suggested a further generalization of the problem, where the cost clusters are allowed to form 
an acyclic directed graph (DAG). A simple model with cost cluster graph is shown in Figure [3l 
We study models with cost cluster DAGs in Section [3l 

The cost cluster scenarios discussed so far differ from the scenario where there are cost 



clusters ^^without inside information" [Langseth and JensenI . l200l[ |. Without inside information. 



^ Note that this is a simplification. In the real life, we quite often decide to buy a new device before we have 

tried everything possible to fix the old one^ 

^This simple observation goes back to lBellmanI [l957l ] and ISmithI [l956| . 




Figure 3: Troubleshooting with cost clusters. At the left side is Bayesian network with faults 
Fi and actions Ai. At the right side is the graph of cost clusters. To access, say, action A2, we 
have to open clusters Cy and C2 (or Cx and C2). In general, a cost cluster may contain more 
than one action. 



we have to close the most recently open cluster whenever we need to check the outcome of 
troubleshooting actions. The cost c(/Cj) is paid wh enever we open /Cj again. Troubleshooting 
cost clusters without inside information is iVP-hard Linl . 12011 1 . 



3 Hardness of Approximation 

The results of this paper pertain to troubleshooting scenarios that do not satisfy the assumptions 
of basic troubleshooting listed in Section 12.1.11 In Section 13. 1|, we formally define four simple 
scenarios, each of them violating exactly one of the assumptions of basic troubleshooting. We 
proceed to prove hardness of approximation for each of these scenarios in Section 13.21 



Preliminaries We review some terminology and notation used later on. An introduction to 
the computational complexity theory can be found in standard textbooks such as Garev and Johnson 
I979I . lAroraandBarakl . l2nn9l |. 

Let L be a minimization problem, let x be an instance of the problem L, and let A be an 
algorithm for problem L. By A{x) we denote the objective value returned by algorithm A when 
applied to instance x. By opt{x) we denote the optimum value of x, and by \x\ we denote the 
size of instance x. Given a function p : N — t- M with p{n) > 1 for all n, we say that algorithm 
A is a polynomial p{n)- approximation algorithm if for all instances x of problem L we have 
■^(x) < p{\x\) ■ opt{x), and the algorithm A operates in time polynomial in |x|. 

We say that it is TVP-hard to approximate problem L within factor p{n), if there is a 
polynomial-time reduction <j) from 3SAT to L, such that (j) combined with an hypothetical 
/9(n)-approximation algorithm for L would make it possible to decide 3SATm. polynomial time. 
By A'^P-completeness of 3SAT, this would imply P=NP. Equivalently, we say that problem L 
has no polynomial p(n)-approximation algorithm unless P=NP. 

For some problems encountered in this paper, p is just a constant. For another, we do not 
state the function p explicitly but only bound it using the 'big-il' notation. 



3.1 Statement of Results 

We state the results first and postpone the proofs to Section [37 



3.1.1 Troubleshooting Scenarions without Questions 

We formally define simple troubleshooting scenarios, each of them breaking one of the require- 
ments of basic troubleshooting listed in Section [2.1.11 Each scenario isolates single property that 
makes it hard to approximate. 

• Troubleshooting with dependent actions (Definition [1]) , breaking Assumption 3. 

• Troubleshooting with multiple dependent faults (Definition [2]) , breaking Assumption 2. 

• Troubleshooting with single fault, independent actions and cost clusters forming a DAG 
(Definition [3]) , breaking Assumption 4. 

We discuss properties of the scenarios in further detail in the remarks after the definitions. 

Definition 1 (Troubleshooting with dependent actions (TSDA)). 

Input: A random variable F with values (faults) /i,...,/„, uniform probability distribution 

V{F), a set of actions {Ai . . . , A/.}. Each action fixes a subset of faults F{Ai) C {/]^, . . . , /„} 

with certainty and no other faults. Each fault is fixed by at least one action. 

Objective: Find a linear ordering of actions minimizing the expected cost of repair. 

Remark. In troubleshooting with dependent actions, the Assumption 3 of basic troubleshooting 
is violated. This happens when the actions solve multiple faults, and some faults are solved by 
several actions. 

Remark. In real situations, the probability distribution V{F) is not necessarily unifom and the 
actions do not fix faults with certainty, but may fail. This comment applies also to Definition 
[2] and [3 

Theorem 1. The TSDA problem has no polynomial (4 — e)- approximation algorithm for any 
e > unless P=NP. 

Remark. Consider a greedy algorithm for troubleshooting with dependent faults described by 



Jensen et al.l 200l[ | . The algorithm builds the action sequence by iteratively picking a previously 
unused action with the highest ratio y^' , where V{A\e) is the probability that the action A 
fixes the fault given that all the preceding actions have failed. The algorithm is called Updating 



P-over-C in (Ottosenl . l2012l | because t he probabilities V{A \e) of unused actions need to be 



updated in every step. Theorem 2.3 in Kaplan et al.l . |2005| | implies that the greedy algorithm 



is in fact a polynomial time 4-approximation algorithm for the following generalization of TSDA: 

• The costs of actions are fixed but arbitrary. 

• We take the single fault assumption, but the fault distribution 'P(F) is not necessarily 
uniform. 

• The distributions V{Ai\F{A,j)) are arbitrary except that the probability V{Ai = \\F{Ai) = 0) 
is zero. 

The result stated in Theorem [1] is therefore tight. 

Definition 2 (Troubleshooting with dependent faults (TSDF)). 

Input: A Bayesian network representing probability distribution V{J-), binary random variables 

Fi, . . . ,Fn £ T (faults), a set of actions {Ai . . . , An}- For each fault, there is exactly one action 

that fixes it with certainty. Each action fixes just one fault. 

Objective: Find a linear ordering vr of actions that minimizes the ECR. 



Remark. In troubleshooting with dependent faults, there can be multiple faults present in the 
system at the same time. That violates Assumption 2 of basic troubleshootinq . If the faults 
occur independently, the problem is solvable in polyr iomial time Sriiiivasl . Il995 |. If the faults 



do not occur independently, the problem is iVP-hard |Vomleloval . 12003 1 . 



Theorem 2. The TSDF problem has no polynomial (4 — e)- approximation algorithm for any 
e > unless P=NP. 

Remark. In Definition [21 there is no restriction on V{J-) except that it is represent ed by a 
Bayesian network. Bayesian network inference is a hard problem in itself Cooperl . ll990f |. In the 



proof of Theorem [2] we construct a Bayesian network for which the inference is easy and yet, 
the troubleshooting problem is hard. 

Definition 3 (Troubleshooting with DAG cost clusters {TSCC)). 

Input: A random variable F with values (faults) fi,...,fn, uniform probability distribution 
V{F), a set of actions {^i, . . . , A^}. For each fault, there is exactly one action that fixes it with 
certainty. Each action fixes just one fault. There is an acyclic directed graph (V, E) of cost 
clusters. Vertices v £ V represent cost clusters and edges n — )• f show that the cluster v can 
be accessed from u. Each action Ai is assigned to exactly one cluster v £ V, and each cluster 
contains zero or more actions. The cost of opening cluster v is c{v) > 0. 

Objective: Find a schedule of cluster opening and troubleshooting actions minimizing the ex- 
pected cost of repair. 

Remark. As discussed in Section 12.1.21 the Assumption 4 of basic troubleshooting is violated 
when there are cost clusters of actions. When the clusters are allowed to form a DAG, we get a 
hard problem. When the c ost clusters form a tree, the problem is solvable in polynomial time 



Ottosen and Jensenl . 120101 ]. 



Tlieorem 3. The TSCC problem has no polynomial (4 — e)- approximation algorithm for any 
e > unless P=NP. The result holds even when the cost cluster graph is bipartite. 

Theorem 4. TSCC is NP-complete. 

3.1.2 Troubleshooting with Questions 

As in Section 13.1.11 we formally define a simple scenario that captures the complexity of using 
questions in troubleshooting. In this case, the probability distribution on faults is not required 
to be uniform, but the actions and questions are deterministic. 

Definition 4 (Troubleshooting with questions {TSQ)). 

Input: A random variable F with values {faults) /i, . . . , /„, a probability distribution V{F), a 

set of actions {Ai . . . , An}. For each fault, there is exactly one action that fixes it with certainty. 

There are m binary questions Qj : {/i, . . . , /„} — >• {0, 1}. 

Objective: Find a troubleshooting strategy minimizing the expected cost of repair. 

Remark. Troubleshooting with questions violates Assumption 5 of basic troubleshooting. 

Remark. In realistic situations, we could have questions with more than two possible values. 
Also, the questions could be nondeterministic. 

Theorem 5. The TSQ problem has no polynomial ri(log n)- approximation algorithm, where n is 
the number of faults, unless P=NP. For the special case o/TSQ with the probability distribution 
V{F) uniform, there is no polynomial (4 — e)- approximation algorithm for any e > 0, unless 
P=NP. 



3.2 Proofs of Theorems 

In all the proofs, the sets are finite, the random variables are discrete, and the probabilities are 
rational numbers. 

3.2.1 Troubleshooting without Questions 

The starting point of all the reductions in this section is the Min-sum set cover - a combina- 
torial problem that combines covering and sequencing. Easy reductions are possible since the 
probability distributions and costs considered in Definitions [U [21 [3] are uniform. 

Definition 5 (Min-sum set cover {MSSCJ)). 

Input: A finite set U , a collection C of subsets of U , that is C C {S' : S* C [/}. 

Objective: Find a linear ordering vr of C minimizing the function 

o-c/,c(vr) = ^i,r(tt), 
where i-jriu) is the index of the first set S ^ C covering element u under the ordering tt. 



Theorem 6 (JFeige et al.l [200j]). MSSC has no polynomial [4: — e)- approximation algorithm for 



any e > unless P=NP. 

Remark. In the proofs of Theorems [H El O we will consider only MSCC instances (?7, C) for 
which a set cover exists, that is C/ = Usee ^- ^^ ^^^ make such an assumption without a 
loss of generality, sir ice the proof of T heorem [6] works with set systems for which set covers are 



guaranteed to exist [Feige et al.l . 12004 ] . 



For later use, we record two simple lemmas. 

Lemma 1. Consider an MSCC instance {U,C) and an ordering vr ofC. Denote by Uj^/i) the 
set of u & U first covered by the i-th set ofn. Then 

\c\ 
o-t/,c(vr) = ^|C4(i)| -i. (3) 

Proof. Straight from Definition \5\ D 

Lemma 2. Assume U = Usee '^- ^^^^^ 1^1 — '^Ufii'^) for any ordering vr. 

Proof. A consequence of Lemma [H D 

In this section, we work with scenarios without questions and use Formula ([2]) for compu- 
tation of the ECR. For a linear ordering vr of actions {Ai, . . . , An}, we define 

Formula ^ becomes 

n 

ECR{A^i^i) , . . . , ^^(„) ) = ^ p^(i) • ^ c( A^(i) ) . (4) 

4=1 j<i 

Theorem 1. The TSDA problem has no polynomial (4 — e)- approximation algorithm for any 
e > unless P=NP. 



Proof. The idea of the proof is to show that a special case of troubleshooting with dependent 
actions is equivalent to min-sum set cover. 

An instance {U,C) of the MSCC problem is reduced to an instance of TSDA as follows. 
There is a fault variable F with a set of values {fu'-u^ C/}. Distribution V{F) is uniform. 
For each S G C there is an action ^4^. Action ^4^ fixes fault /„ with probability one if and 
only if u G S*. All the actions have unit cost. For the ease of notation, the actions are indexed 
by integers 1, . . . , A;, where fe = |C|. Given an arbitrary linear ordering ^7r(i), . . . , ^7r(A:) of the 
actions, denote by F^^f^i-^ the set of faults first covered by the i-th action of the ordering. Using 
(JH , we compute the expected cost of repair 

k k 

From Lemma [U we have that the ECR multiplied by |f7| is equal to (TufiiT^)- Therefore any 
approximation algoritm for TSDA could be used to approximate {U,C) with the same approxi- 
mation ratio. D 

Theorem 2. The TSDF problem has no polynomial (4 — e)- approximation algorithm for any 
e > unless P=NP. 



Proof. We use an idea bv I Vomleloval 2003] to transform the TSDA instance constructed in the 



proof of Theorem [T] to an instance of TSDF. First, construct a Bayesian network with vertex 
set 

{F}[j{Fu}u^u[j{As}s^c 

and edge set 

{F -^ Fu}u& \J{Fu -^ As}u&s- 

An example of such a network is shown in the left side of Figure HI Variable F has states 
{/u}ugc/ aiid an uniform probability distribution. All the other variables have deterministic 
distributions of probability: 

• The probability V{Fu = 1|F) equals one \i F = fu, otherwise it equals zero. 

• The probability 'P{As = l\{Fu}u&s) equals one if at least one parent Fu of As has value 
1, otherwise the probability equals zero. 

We use the network to create an instance of the TSDF problem. Add to the network k new 
vertices A'g and edges ^4^ — )• A'^ (the dotted part of Figure H]) . The new vertices are actions of 
the TSDF instance, the original actions Ag are now faults. Cost of each new action is one. The 
optimal ECR of the original TSDA problem is the same as the optimal ECR of the constructed 
TSDF problem. D 

Theorem 3. The TSCC problem has no polynomial (4 — e)- approximation algorithm for any 
e > unless P=NP. The result holds even when the cost cluster graph is bipartite. 

Proof. We again reduce min-sum set cover. 

For each element u £ U, create a fault /„ with probability 'P{F = /„) = 1/\U\ and an action 
Au, solving exclusively /„. Each action is contained in an associated cost cluster, Cu- The cost 
of opening Cu and performing A^ equals one. 

For each set S € C, create an empty cluster Cs with cost c (c is a large positive constant to be 
discussed later). Create directed edges from the C^'s to the C^'s according to set membership 




Figure 4: Model with dependent actions for U = {1, 2, 3, 4} and C = {{1, 2}, {2, 3}, {2, 4}}. The 
dotted edges and vertices show extension to a model with dependent faults. 

- there is an edge Cs — >• Cu if and only \i S 3 u. An example of a cost cluster model created 
this way is in Figure [3l 

In any optimal schedule, each cluster Cu is opened right before performing Au] therefore, 
we shall not mention opening of the CuS in the rest of the proof. Any troubleshooting sequence 
has to contain all the actions A^. Their order is arbitrary, since their costs and probabilities 
of success are uniform. Thus the ECR can be decomposed as a sum of the expected cost of 
actions and the expected cost of opening "top-level" clusters Cs- Assume the clusters {Cs}seC 
are indexed by integers 1, . . . , A; and are opened in sequence Ci, . . . , C^. Using ([1]), we get 

n . k ^ k 

actions top-level clusters 

where n = |C/|, k = \C\ and 'P{Cj) is the probability that by opening Cj we make accessible 
an action that fixes the fault. In such a case, Cj is the last cluster of the sequence that needs 
to be open. We assume that once a cluster Cs is open, we perform all the actions accessible 
from Cs except for those that have already been performed. Indeed, if we did not perform the 
actions greedily after opening each cost cluster, the ECR would increase, because some of the 
cost clusters could be opened needlessly. With this assumption, ViCj) = |F7r(j)|/n, where i^7r(j) 
is the set of actions first made available by opening the j'-th cluster. Let vr be some ordering 
of C and let (T[/^c(^) be the value given by ([3]). Then there is a corresponding troubleshooting 
sequence specified by ordering vr with 

ECm.) = '^ + c."Ji£^, (5) 

2 n 

and the correspondence of MSSC solutions and TSCC solutions is one to one. 

We conclude the proof by showing that for large values of c, the ratio ECR{tt) / ECR{tt*) < 
A — e implies 

^^<4-e. (6) 

By Theorem[6l this would imply P=NP. However, we have to make sure that the representation 
of c is not larger than a polynomial in n. 



10 



Let us denote any lower bound of ECR{'7t*) by ECR ln*), and use (0) to express au,c as 
{ECR{tt) — ■^^^) • ^. With a little algebra, one can rewrite ([6|) and verify that it is implied by 

ECR{7t) ^ ^ ^ 2(n + l) ^ ^^^ 



ECR{7r*) - ECR{tt*) 

We set c = (2A; - \){n + l),k> 0, and use (0) and Lemma[2]to obtain ECR (-k*) = ^ + c 
2k{n -\- 1). We use the .ECi? in inequality ([7]) and get the implication 

ECR{.) ^, ^ l^ -ud-) . , ^ 



ECR{tt*) - k au,c{T^* 

Now we are almost done - all we need to show is that ^ can be arbitrarily close to zero and 
yet, the number of bits needed to encode c as a binary number is polynomial in n. Let k equal 
2"(")j where a{n) is an arbitrary polynomial in n, and let \c\ be the length of c in bits. We 
check that \c\ remains polynomial in n: 

log2(2A:-i)(n+l) 

= 0(log2A: + log2n) 

= 0(a(n)) when fe = 2"("). 

D 

Proof of Theorem [7} -A^P-completeness is a concept defined for decision problems. Therefore, 
we have to consider the decision variant of TSCC: given an arbitrary positive constant K and 
an instance x of TSCC, is it true that opt{x) < Kl This decision problem belongs to class 
NP - once we guess the strategy s{x), it is easy to compute the ECR in polynomial time and 
check ECR{s{x)) < K. To show iVP-hardness, we can use the same reduction as in the proof of 
Theorem [3] (ending with Equation [5]) . D 

3.2.2 Troubleshooting vi^ith Questions 

We reduce the Decision tree problem to TSQ (Definition U]) . 
Definition 6 (Binary decision tree {DT) [Garev and Johnson! . Il979l |). 



Input: A set <5 = {ei, . . . , e„} of entities with a probability distribution V{£); we assume that 

all the probabilities are non-zero. A set T = {Ti, . . . ,Tm} of functions Tj : £ ^- {0, 1} called 

tests. 

Objective: Construct a decision tree for £ using the tests in T with minimal weighted external 

path length (defined below). 

A decision tree is a rooted binary tree with leaves labeled by e, G <? and non-leaf vertices 
labeled by Tj G T. If an entity passes a test, Tj{ei) = 1, it follows the right branch, otherwise 
it follows the left branch. It is assumed that for each pair of distinct entities e^, e^, there exists 
a test Tj with Tj{er) = 1 and Tj{es) = 0. Therefore, the tests can be used to correctly identify 
each entity, and we require that a path from the root of the decision tree to leaf uniquely 
determines the item Cj that labels the leaf. 

Weighted external path length of tree T is Vl^(T) = X]r=i 'P{^i)diei), where d{ei) is the length 
of the path from the root of T to the leaf labeled by ej. 



The ZJT problem is known to be NP-hard Hvafil and Rivesti . ll976t |. As for inaproximability, 
we have a recently proven theorem: 



11 



Theorem 7 ( Chakaravarthv et al.1 201 ll ]). The DT problem has no polynomial r2(log n)- approximation 



algorithm, where n is the number of entities in the input, unless P=NP. For the special case of 
DT with the probability distribution V{£) uniform, there is no polynomial (4 — e)- approximation 
algorithm for any e > 0, unless P=NP. 

Troubleshooting with questions is at least as hard to approximate as DT: 

Theorem 5. The TSQ problem has no polynomial r2(log n)- approximation algorithm, where n is 
the number of faults, unless P=NP. For the special case o/TSQ with the probability distribution 
V{F) uniform, there is no polynomial (4 — e)- approximation algorithm for any e > 0, unless 
P=NP. 

Proof. We reduce DT to TSQ as follows: 

• For each entity Cj G £, there is a correspoding fault /j, Vifi) = 'P(ej). 

• There are questions Qi, . . . , Qm corresponding to the tests, V{Qj = l|/i = 1) = Tj{ei). 

• The questions have unit cost. 

• We take the single fault assumption, and for each fault /j, there is exactly one perfect 
action Ai solving the fault with uniform cost ca (a large enough constant to be specified 
later) . 

The idea of the reduction is to set the cost of action ca high enough so that it is never profitable 
to perform an action before a fault is completely identified by the questions. By LemmaObelow, 
a cost of action with this property is ca = 1/ miuegg 'P(e) . Note that the size of representation of 
CA is not an issue here, since 'P(e) is part of the input parameters representing the DT prohlem. 
Using Lemma [31 we can restrict our attention to troubleshooting strategies that first identify 
the fault using a decision tree for (£, T) and perform a single action afterwards. For decision 
tree T, the ECR of corresponding troubleshooting strategy is 

ECR{T) = W{T) + CA. 

Unless P = NP, Theorem [7| implies that for any approximation algorithm, there will be input 
instances of the Decision tree problem with approximation ratio 

W(T) 

> r • log n 




for some (fixed) constant r and large n. A little rearrangement of the above inequality shows 
that any approximation algorithm for TSQ will have approximation ratio EC R{T) / EC R{T*) > 
r' ■ logn, for some r' < r: 

W{T) + CA > iy(T*)T-logn 
W{T) + CA ^ W{T*) ^ 1 

wiT*) + cA - iy(T*) + c^-^-^°^^ = TT3r--i«g- 

§§i^) ^ T^-logn, since 14^(T*)>1. 

D 
The proof of Theorem [5] relies on Lemma [3l 
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Lemma 3. Consider the instance of TSQ constructed in the proof of Theorem [3| above, and 
denote by e the evidence at any particular step of a troubleshooting strategy. Then: 

• When there is an action A with probability of success 0<'P(A = l|e)<l, there is also 
at least one question Q with probability <V{Q = l\e) < 1. 

• Moreover, if the uniform, cost of action is ca = 1/niin"^^ 7^(/j), then performing the 
question Q immediately before the action A leads to a strategy with lower expected cost 
that performing A immediately before Q. 

Proof. The reduction from the DT problem to TSQ ensures that when there is an action A with 
the probabihty of success 'P(A = l|e) < 1, then: 

• There are at least two distinct faults fi and fj, i 7^ j, with nonzero probability. 

• Action A solves exactly one of the faults fi and fj . 

• There is a question Q that can "distinguish" between the two faults, that \sV{Q = l|e, /j) 
is either or 1, and V{Q = l|e, fj) equals 1 — V{Q = l|e, fi). 

We want to show that it is always better to perform Q before A. Consider two strategies: 

• Strategy Si performs action A first, and if it fails, it performs question Q. 

• Strategy S2 performs Q first and then it performs A in one of its branches. 

The two strategies are shown in Figure [5j Since action A addresses just one fault, there are two 
cases to consider. Either 

a) V{A = 1,Q = l|e) = 0, or 

b) P(^ = l,Q = 0|e) = 0. 

In case a), strategy S2 schedules the action A in the 0-branch emanating from Q as shown at 
the right hand side of Figure [5j In case b), strategy S2 schedules the action A in the 1-branch. 
In the rest of the proof, we consider case a), but case b) is symmetric. We denote the cost 
of Q by cq, and the sub-trees of strategies by to and ti. We use the notation introduced in 
Section [2} For a leaf node (.., we denote by e^ the associated evidence. By t{t) we denote the 
cost of reaching £ from the root of to (or ti). The expected costs of the two strategies Si and 
S2 are 

ECR{Sx\e) = V{A = l\e)-CA 

+ Yl V{ei,A = {),Q = l\e)-[cA + CQ+t{l)] 

+ Yl V{ee,A = 0,Q = 0\e)-[cA + CQ+t{i)], 

te£(to) 

ECR{S2\e) = Y. 'P{ei,Q = l\e)-[cQ + t{i)] 
te£(ti) 

+V{Q = 0,A = l\e)-[cQ + CA] 
+ Y 'P{ee,A = 0,Q = 0\e)-[cA + CQ+t{£)]. 
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Strategy Si 
A 




Strategy S2 

Q 




Figure 5: The two strategies from the proof of Lemma [3l Strategy S2 is constructed from 
strategy Si by interchaning action A and question Q. 

To have strategy S2 more efficient than strategy Si, we need ECR(S2\e) — ECR{S\\e) < to 
hold. Since the probabihty V{A = 1,Q = l|e) is zero, most of the terms conveniently cancel 
out when we perform the subtraction: 

> ECR{S2\e) - ECR{Sx\e) 



Q > CA 



V{Q = Q,A = l\e)-V{A = l\e) 



V(A=l,Q=l\e)=0 

+CQ-V{Q = {),A = l\e) 

' V ' 

V{A=l\e) 

+ Yl [P{ee,Q = l\e)-Vie^,A = 0,Q = l\e) 



CQ + m 



V{ee,A=l,Q=l\e)=0 

-CA- Yl 'P{ee,A = 0,Q = l\e) 

^ V ' 

V{Q=l\e) 

> CQ-ViA = l\e)-CA-V{Q = l\e) 



(8) 



To finish the proof, we claim that the value ca = 1/min"^^ T'(/j) satisfies inequality ([8]) since: 

• As we have argued, V{Q = l|e) > 0. By the construction in the proof of Theorem [SJ we 
have r{Q = l|e) > minf^^Vifi). 

• By the construction in the proof of Theorem O we have also cg = 1. 

• We assume V{A = l|e) < 1. 

D 



4 Summary and Discussion 



The results are summarized in Tabled) As discussed in Section fS.l.ll the result for troubleshoot- 
ing with dependent actions is tight for a restricted, but relevant variant of the problem. For the 
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other scenarios, no approximation algorithms have been designed so far. Therefore the results 
reported might not be tight. 

Table 1: Summary of results 



Troubleshooting scenario 


Hard to approximate within 


Dependent actions (TSDA) 


4 - e for all e > 


Dependent faults ( TSDF) 


4 - e for all e > 


DAG Cost clusters {TSCC) 


4 - e for all e > 


Questions (TSQ) 


n{logn) 



As we have shown in Section 13. 2. H the min-sum set cover problem Feige et al.l . 120041 ] is 
equivalent to a special case of troubleshooting with dependent actions . A generalized variaii t 
of min-sum set cover is studied under the name pipelined set cover bv iMunagala et al.l 2005| |. 
There, the sets (corresponding to faults) have associated weights (corresponding to probabili- 
ties of faults), and the elements (c orresponding t o rep air actions) have ass ociated cost s. The 



problem is further ge i ieralized by Kaplan et al.l 20051]. All three papers Feiee et al.l . l2004l . 



Munagala et al.l . l2005l . iKaplan et al.l . 120051 ] provi de insights and resu lts that are directly appli- 
cable to the greedy algorithm Updating P-over-C [Jensen et al.l . l200ll ] discussed in this paper in 
Section 13. li 



Problems solvable in polyn omial time A special case of troubleshooting with questions is 
implicitly addressed already bv lJohnsonI 19561 ] . He gives an algoritm that runs in 0{n log n) time 
and is optimal under reasonable assumptions. Likewis e, a very restricted case of trou bleshooting 
with dependent actions is solvab le by graph matching Vomlelova and Vomlel 120031 ] in 0{n^/m) 



Micali and Vazirani Il980f ] , where m is the number of faults and n is the number of actions 



time 

Identification of interesting special cases solvable in polynomial time is an open research area. 
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